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20 Abstract 

21 On 1 August 2009, the global CoUaboratory for the Study of Earthquake Predictability (CSEP) 

22 launched a prospective and comparative earthquake predictability experiment in Italy. The goal of 

23 the CSEP-Italy experiment is to test earthquake occurrence hypotheses that have been formalized as 

24 probabilistic earthquake forecasts over temporal scales that range from days to years. In the first round 

25 of forecast submissions, members of the CSEP-Italy Working Group presented eighteen five-year and 

26 ten-year earthquake forecasts to the European CSEP Testing Center at ETH Zurich. We considered the 

27 twelve time-independent earthquake forecasts among this set and evaluated them with respect to past 

28 seismicity data from two Italian earthquake catalogs. In this article, we present the results of tests that 

29 measure the consistency of the forecasts with the past observations. Besides being an evaluation of the 

30 submitted time-independent forecasts, this exercise provided insight into a number of important issues 

31 in predictability experiments with regard to the specification of the forecasts, the performance of the 

32 tests, and the trade-off between the robustness of results and experiment duration. We conclude with 

33 suggestions for the design of future earthquake predictability experiments. 

34 Keywords: Probabilistic forecasting, earthquake predictability, hypothesis testing, likelihood. 

35 1 Introduction 

36 On August 1, 2009, a prospective and competitive earthquake predictability experiment began in the 

37 region of Italy Schorlemmer et al. 2010a . The experiment follows the design proposed by the Regional 

38 Earthquake Likelihood Model (RELM) working group in California \Field}^QQ7\\Schorlemmer et a/.||2007| 

39 \Schorlemmer and Gerstenberger] |2007[ ISchorlemmer et al.\ |2010b| and falls under the global umbrella of 

40 the CoUaboratory for the Study of Earthquake Predictability (CSEP) \Jordan 2006 Zechar et al. 2009 . 

41 Eighteen five-year forecasts that express a variety of scientific hypotheses about earthquake occurrence 

42 were submitted to the European CSEP Testing Center at ETH Zurich. In this article, we present 

43 the results from testing these forecasts retrospectively on seismicity data from two Italian earthquake 

44 catalogs. 

45 The rationale for performing these retrospective tests is as follows: 

46 1/ To verify that the submitted forecasts are as intended by the modelers; 
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47 II/ To provide a sanity-check of tlie forecasts before the end of the five-year or ten-year experiments; 

48 III/ To provide feedback to each modeler about the performance of her or his model in retrospective 

49 tests and to encourage model improvements; 

50 IV/ To better understand the tests and performance metrics; 

51 V/ To have worked, pedagogical examples of plausible observations and results; and 

52 VI/ To understand the relation between the duration of predictability experiments and the robustness 

53 of the outcomes. 

54 Nevertheless, retrospective tests also come with significant caveats: 

55 1. We only evaluated time-independent models; to fairly test time-dependent models on past data 

56 would require that the model software be installed at the testing center so that hindcasts could be 

57 generated. We identify long-term forecasts from time-dependent models in section |2] and we did 

58 not analyze these forecasts. 

59 2. Past data may be of lower quality than the data used for prospective testing (e.g. greater uncer- 

60 tainties in magnitude and location, or missing aftershocks, potentially with systematic bias). 

61 3. There are many versions of the past, in the form of several available earthquake catalogs. In an 

62 attempt to address this issue, we tested with respect to two catalogs (see section |4|. 

63 4. All of the forecasts considered here are in some way based on past observations. For example, 

64 parameters of the models typically were optimized on part or all of the data against which we 

65 retrospectively tested the models. Therefore, positive retrospective test results might simply reveal 

66 that a model can adequately fit the data on which it was calibrated, and they might not be 

67 indicative of future performance on independent data. A study beyond the scope of this article 

68 would be required to decide which of the retrospective data can be regarded as out-of-sample for each 

69 model. On the other hand, poor performance of a time-independent forecast in these retrospective 

70 experiments indicates that the forecast cannot adequately explain the available data. Therefore, 

71 one aim of this article is to identify forecasts of time-independent models that consistently fail in 

72 retrospective tests, thereby separating ineffective time-independent models from potentially good 

73 models. 

74 Poor performance of a time-independent forecast might result from one or more of several factors: 

75 technical errors (i.e., errors in software implementation), a misunderstanding of the required object to be 
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forecast, calibration on low-quality data, evaluation with low-quality data, statistical type II errors, or 
incorrect hypotheses of earthquake occurrence. CSEP participants seek to minimize the chances of each 
of these effects save the final one: that a forecast is rejected because its underlying hypotheses about 
earthquake occurrence are wrong. 



This article is accompanied by an electronic supplement (available online at http : / /mercalli . ethz . 



[ch/~niwerner/CSEP_ITALY/ESUPP/ I; the reader can find additional figures and a table of information 



gains that aid in the evaluation of the considered forecasts. 

2 Overview of the Time-Independent Models 

Each of the forecasts submitted to the five- and ten-year CSEP-Italy experiment can be broadly grouped 
into one of two classes: those derived from time-independent models, and those that derive from time- 
dependent models (see Table [T|. The forecasts in the former class are considered to be suitable for 
any time translation and they depend only on the length of the forecasting time interval (at least over 
a reasonable time interval where the models are assumed to be time-independent). Therefore, these 
forecasts can be tested on different target periods. In contrast, the forecasts in the latter group depend 
on the initial time of the forecast. Because the recipes for calculating the forecasts (i.e. the model 
software) were not available to us, we could not generate hindcasts from these models that could be 
meaningfully evaluated. We therefore did not consider time-dependent models in this study. Below, we 
provide a brief summary of each time-independent model. 

The model Akinci-et-AL.Hazgridx contains the assumption that future earthquakes will occur close 
in space to locations of historical m > 4 mainshocks. No tectonic, geological or geodetic information was 
used to calculate the forecast. The model is based on the method by Weichert 1980| to estimate the 



seismic rate from declustered earthquake catalogs whose magnitude completeness threshold varies with 
time. The forecast uses a Gutenberg-Richter law with a uniform b-value. 

Chan-et-AL.Hzati considers a specific bandwidth function to smooth past seismicity and to evaluate 
the spatial seismicity density of earthquakes. The model smoothes both spatial locations and magnitudes. 
The smoothing procedure is applied to a coarse seismotectonic zonation based on large-scale geological 
structure. The expected rate of earthquakes is obtained from the average historical seismicity rate. 

Each Asperity Likelihood Model (ALM)-Gulia-Wiemer. ALM, Gulia-Wiemer.HALM, 
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SCHORLEMMER-WlEMER.ALM-hypothesizes that small-scale spatial variations in the b-value of the 



Gutenberg-Richter relationship play a central role in forecasting future seismicity Wiemer and Schor- 



llemmerl |2007| . The physical basis of the model is the concept that the local b-value is inversely pro- 
portional to applied shear stress. Thus low b-values {b < 0.7) are thought to characterize the locked 
patches of faults (asperities) from which future mainshocks are more likely to be generated, whereas 
high b-values (fe > 1.1), found for example in creeping sections of faults, suggest a lower probability of 
large events. The b-value variability is mapped on a grid. The local a and b-values in the forecasts 
GuLiA- Wiemer. ALM and Gulia- Wiemer. HALM were obtained from the observed rates of declus- 



tered earthquakes between 1981 and 2009, using Reasenberg's declustering method Reasenberg 1985 



and the Entire-Magnitude-Range method for completeness estimation by Woessner and Wiemer 2005 



In the GULIA- Wiemer. HALM model (Hybrid ALM), a "hybrid" between a grid-based and a zoning 
model, the Italian territory is divided into distinct regions depending on the main tectonic regime and 
the local b-value variability is thus mapped using independent b-values for each tectonic zone. In the 



SCHORLEMMER- Wiemer. ALM model, derived from the original ALM [ Wiemer and Schorlemmer 2007 



the authors decluster the input catalog (2005-2009) for m > 2 using the method by Gardner and Knopoff 



1974 and smooth the node-wise rates of the declustered catalog with a Gaussian filter. Completeness 



values for each node are taken from the analysis by Schorlemmer et al. 2010 using the probability-based 



magnitude of completeness method. The resulting forecast is calibrated to the observed average number 
of events with m > 4.95. 

The MELETTI-ET-AL.MPS04 model \Gruppo di lavoro MPg} [2004] |http : //zonesismiche .mi . ingv. | 

[it] is the reference model for seismic hazard in Italy. Meletti-et-AL.MPS04 derives from the standard 
approach to probabilistic seismic hazard assessment of Cornell |1968| , in which a Poisson process is 
assumed. The model distributes the seismicity in a seismotectonic zonation and it considers the historical 
catalog using, through a logic tree structure, two different ways (historical and statistical) to estimate 
its completeness. The models also assumes that each zone is characterized by its own Gutenberg-Richter 
law with varying truncation. 

The Relative Intensity (RI) model (Nanjo-et-AL.RI) is a pattern recognition model based on the 
main assumption that future large earthquakes tend to occur where the seismic activity had a specific 
pattern (usually a higher seismicity) in the past. In its first version, the RI code was "alarm-based," i.e., 
the code made a binary statement about the occurrence of earthquakes. For the CSEP-Italy experiment. 
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134 the code was modified to estimate the expected number of earthquakes in a specific time-space-magnitude 

135 bin. 

136 The models Werner-et-al.CSI and Werner-et-al. Hybrid are based on smoothed seismicity. 

137 Future earthquakes are assumed to occur with higher probabihty in areas where past earthquakes have 

138 occurred. Locations of past mainshocks are smoothed using an adaptive power-law kernel, i.e little in 

139 regions of dense seismicity, more in sparse regions. The degree of smoothing is optimized via retrospective 

140 tests. The magnitude of each earthquake is independently distributed according to a tapered Gutenberg- 

141 Richter distribution with corner magnitude 8.0. The model uses small magnitude m > 2.95 quakes, 

142 whenever trustworthy, to better forecast future large events. The two forecasts Werner- et-AL.CSI and 

143 Werner-et-AL. Hybrid were obtained by calibrating the model on two diff'erent earthquake catalogs. 

144 The forecasts Zechar-Jordan.CPTI, Zechar-Jordan.CSI and Zechar- Jordan. Hybrid are de- 
ws rived from the Simple Smoothed Seismicity (Triple-S) model, which is based on Gaussian smoothing 

146 of past seismicity. Past epicenters make a smoothed contribution to an earthquake density estimation, 

147 where the epicenters are smoothed using a fixed lengthscale a; a is optimized by minimizing the average 

148 area skill score misfit function in a retrospective experiment Zechar and Jordan 2010a . The density map 

149 is scaled to match the average historical rate of seismicity. The two forecasts Zechar-Jordan.CPTI 

150 and Zechar-Jordan.CSI were optimized for two different catalogs, while Zechar- Jordan. Hybrid is 

151 a hybrid forecast. 

152 3 Specification of CSEP-Italy Forecasts 

153 We use the term "seismicity model" to mean a system of hypotheses and inferences that is presented as 

154 a mathematical, numerical and simplified description of the process of seismicity. A "seismicity forecast" 

155 is a statement about some observable aspect of seismicity that derives from a seismicity model. In the 

156 context of the CSEP-Italy experiment, a seismicity forecast is a set of estimates of the expected number 

157 of future earthquakes in each bin, where bins are specified by intervals of location, time and magnitude 

158 within the multi-dimensional testing volume [see also Schorlemmer et al. 2007 . More precisely, the 

159 CSEP-Italy participants agreed (within the official "Rules of the Game" document) to provide a numerical 

160 estimate of the likelihood distribution of observing any number of earthquakes within each bin. Moreover, 

161 this discrete distribution, which specifies the probability of observing zero, one, two, etc earthquakes in 
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a bin, is given by a Poisson distribution (defined below in section 4.3 1 which is uniquely defined by the 



expected number of earthquakes. Each bin's distribution is assumed independent of the distribution in 
other bins, and the observed number of earthquakes in a given bin is compared with the forecast of that 
bin. 



4 Data Used For Retrospective Testing 

For prospective tests of the submitted forecasts, the Italian Seismic Bulletin {Bollettino Sismico Italiano, 



BSI) recorded by INGV will be used [see Schorlemmer et al. 2010a . We did not use the BSI for a 



retrospective evaluation of forecasts because it is only available in its current form since April 2005. 
Instead, we used two alternative Italian earthquake catalogs provided by the INGV, which were also 
provided as a tool for the modelers for model learning and calibration: the Catalogo Parametrico dei 



Terremoti Italiani (Parametric Catalog of Italian Earthquakes, CPTI08) Rovida and the CPTI Working 



Group 2008 and the Catalogo del la Sismicit'a Italiana (Catalog of Italian Seismicity, CSI 1.1) [Castello 



et al. 2007 Chiarahha et al. 2005| . Schorlemmer et al. 2010a discuss the catalogs in detail, we only 



provide a brief overview. Both data sets are available for download from ||http: //www, cseptesting. org/| 
[regions/ Italy] 



4.1 The CSI 1.1 Catalog 1981-2002 

The CSI catalog spans the time period from 1981 until 2002 and reports local magnitudes, in agreement 
with the BSI magnitudes that will be used during the prospective evaluation of forecasts. \Schorlemmer] 
et al. 2010a| found a clear change in earthquake numbers per year in 1984 due to the numerous network 



changes in the early 1980s and therefore recommend using the CSI data from 1 July 1984 onwards. 
For the retrospective evaluation, we selected earthquakes with local magnitudes Ml > 4.95 from 1985 
until the end of 2002. To mimic the durations of the prospective experiments, we selected three non- 
overlapping five-year periods (1998-2002, 1993-1997, 1988-1992). To test the robustness of the results, 
we also used the entire 18-year span of reliable data from 1985 until 2002. We selected shocks as test 



data if they occurred within the CSEP-Italy testing region [see Schorlemmer et al. 2010a 
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4.2 The CPTI08 Catalog 1901-2006 



The CPTI catalog covers the period from 1901 until 2006 and is based on both instrumental and his- 
torical observations (for details, see \Schorlemmer et aZ.| [2010a| ). The catalog lists moment magnitudes 
that were estimated either from macroseismic data or calculated using a linear regression relationship 
between surface wave, body wave or local magnitudes. Because the prospective experiment will use 
local magnitudes, we converted the moment magnitudes to local magnitudes using the same regression 
equation that was used to convert the original local magnitudes to moment magnitudes for the creation 
of the CPTI catalog \Gruppo di lavoro MPS\ |2004| \Schorlemmer et al.\ |2010a| : 



Ml = l.2U{Mw - 1.145) 



(1) 



Schorlemmer et al. 2010a estimated a conservative completeness magnitude of Ml = 4.5, so that 



one could justify using the entire period from 1901 until 2006 for the retrospective evaluation. However, 



we focused mainly on the data since the 1950s because it seems to be of higher quality ISchorlemmer 



et al. 2010a . We divided the period into non-overlapping ten-year periods to mimic the duration of 



the prospective experiment, but we also evaluated the forecasts on a 57-year time span from 1950 until 
2006 and on the 106-year period from 1901 until 2006. As for the CSI catalog, we only selected shocks 
within the testing region. Some quakes, mostly during the early part of the CPTI catalog, were not 
assigned depths. We included these earthquakes as observations within the testing region because it is 



very unlikely that they were deeper than 30 km (see also Schorlemmer et al. 2010a 



4.3 The Distribution of the Number of Earthquakes 

In this section, we consider the distribution of the number of observed events in the five- and ten-year 
periods relevant for the Italian forecasts. Analysis of this empirical distribution can test the assumption 
(made by all time-independent forecasts) that the Poisson distribution approximates well the observed 
variation in the number of events in each cell and in the entire testing region. (CSEP-Italy participants 



decided to forecast all earthquakes and not only so-called mainshocks - see section 4.4 1 
The Poisson distribution is defined by its discrete probability mass function: 



/ ,nCxp(-A) 
p{n\X) = A — ^ 



(2) 
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where n = 0, 1, 2, and A is the rate parameter, the only parameter needed to define the distribution. 
The expected value and the variance apQj of the Poisson distribution are both equal to A. 

Because the span of time over which the CSI catalog is reliable is so short, we used the CPTI catalog 
for the seismicity rate analysis. The sample variance of the distribution of the number of observed 
earthquakes in the CPTI catalog over non-overlapping five-year periods from 1907 until 2006 (inclusive) 
is a^yr = 23.73, while the sample mean is fi^yr — 8.55. For non-overlapping ten-year periods of the CPTI 
catalog, the sample variance is afgyj. = 64.54, while the sample mean is fiioyr ~ 17.10. Because the 
sample variance is so much larger than the sample mean on the five- and ten-year timescales, it is clear 
that the seismicity rate varies more widely than expected by a Poisson distribution. 

Figure [l] shows the number of observed earthquakes in each of the twenty non-overlapping five- 
year intervals, along with the empirical cumulative distribution function. The Poisson distribution with 
A — fj,5yr — 8.55 and its 95% confidence bounds are also shown. One should expect that one in twenty 
data points falls outside the 95% confidence interval, but we observe four, and one of these lies outside 
the 99.99% quantile. 

We compared the goodness of fit of the Poisson distribution with that of a negative binomial distribu- 
tion (NBD), motivated by studies that suggest its use based on empirical and theoretical considerations 



Vere-Jones 1970 Kagan 1973 Jackson and Kagan 1999 Kagan 2010 Schorlemmer et al. 2010b 



Werner et al. 2010a .The discrete negative binomial probability mass function is 



(3) 



where n = 0, 1, 2, F is the gamma function, < i' < 1, and r > 0. The average of the NBD is given 
by 



1 - V 



(4) 



while the variance is given by 



2 _ 1 

<^NBD — — 



(5) 



implying that o^^gjj > (Tpoi- Kagan 



2010 



discusses different parameterizations of the NBD. For 
simplicity, we used the above parameterization and maximum likelihood parameter value estimation. We 
found rsy,. = 6.49 and u^yr = 0.43, with 95% confidence bounds given by [—0.39,13.37] and [0.17,0.70], 



respectively. The large uncertainties reflect the small sample size of twenty. For the ten-year intervals. 
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we estimated Tioyr ~ 9-24 and fioyr ~ 0.35, with 95% confidence bounds given by [—2.74,21.22] and 
[0.05,0.65], respectively. Figure [l] shows the 95% confidence bounds of the fitted NBD in the number of 
observed events (left panel), and the NBD cumulative distribution function (right panel). 

Because the NBD is characterized by two parameters while the Poisson has only one, we used the 



Akaike Information Criterion (AIC) Akaike 1974 to compare the fits 



AIC ^2p-2log{L) 



(6) 



where L is the likelihood of the data given the fitted distribution and p is the number of free parameters. 
For the five-year and ten- year intervals, the NBD fit the data better than the Poisson distribution, despite 
the penalty for the extra parameter: for the five-year intervals, AICnbd ~ 117.32 and AICpoi = 126.20, 
while for the ten- year intervals, AICnbd = 70.05 and AICpoi ~ 77.56. To test the robustness of the 
better fit of the NBD over the Poisson distribution, we also checked the distribution of the number of 
events in one-year, two-year and three-year intervals of both catalogs. In all cases, the NBD fit better 
than the Poisson distribution, despite the penalty for an extra parameter. 



4.4 Implications for the CSEP-Italy Experiment 

Several previous studies showed that the distribution of the number of earthquakes in any finite time 



period is not well approximated by a Poisson distribution and is better fit by an NBD Kagan 1973 



Jackson and Kagan 1999 Schorlemmer et al. 2010b Werner et al. 2010a or a heavy-tailed distribution 



Saichev and Sornette 2006 . The implications for the CSEP-Italy experiment, and indeed for all CSEP 



experiments to date, are important. 

The only time-independent point process is the Poisson process \Daley and Vere- Jones 2003| . There- 
fore, a non-Poissonian distribution of the number of earthquakes in a finite time-period implies that, if 
a point process can model earthquakes well, this process must be time-dependent (although there might 
be other, non-point-process classes of models that are time-independent and generate non-Poissonian 
distributions). Therefore, the Poisson point process representation is inadequate, even on five- or ten- 
year timescales for large m > 4.95 earthquakes in Italy, because the rate variability of time-independent 
Poisson forecasts is too small, and they will fail more often than expected. As a result, the agreement 
of CSEP-Italy participants to use a Poisson distribution should be viewed as problematic for time- 
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262 independent models because no Poisson distribution that their model could produce will ever pass the 

263 tests with the expected 95% confidence rate. On the other hand, time-dependent models with vary- 

264 ing rate might generate an NBD over a longer time period (the empirical NBD can even be used as a 

265 constraint on the model). 

266 Solutions to this problem have been discussed previously. The traditional approach has been to 

267 filter the data via declustering (deletion) of so-called aftershocks (as done, for instance, in the RELM 

268 mainshock experiment Field 2007 Schorlemmer et al. 2007 ). However, the term "aftershock" is 

269 model-dependent and can only be applied retrospectively. A more objective approach is to forecast all 

270 earthquakes, allowing for time-dependence and non-Poissonian variability. In theory, each model could 

271 predict its own distribution for each space-time-magnitude bin Werner and Sornette 2008 , and future 

272 predictability experiments should consider allowing modelers to provide such a comprehensive forecast 

273 (see also section [7|. 

274 A third, ad-hoc solution [see Werner et al. 2010a| is more practical for time-independent models 

275 in the current context. Based on an empirical estimate of the observed variability of past earthquake 

276 numbers, one can reinterpret the original Poisson forecasts of time-independent models to create forecasts 

277 that are characterized by an NBD. One can perform all tests (defined below in sectionjsj using the original 

278 Poisson forecasts, and repeat the tests with so-called NBD forecasts. 

279 We created NBD forecasts for the total number of observed events by using each forecast's mean and 

280 an imposed variance identical for all models, which we estimated either directly from the CPTI catalog or 

281 from extrapolation assuming that the observed number of events are uncorrelated. Appendix A describes 

282 the process in detail. Because the resulting NBD forecasts are tested on the same data that were used to 

283 estimate the variance, one should expect that the NBD forecasts perform well. The broader NBD results 

284 in less specificity, but also fewer unforeseen observations. We will re-examine this ad-hoc solution in the 

285 discussion in section [T] 



5 Tests 

To follow the agreed-upon rules of the prospective CSEP-Italy experiment, we used the statistical tests 
proposed for the RELM experiment and more recent ones that have been implemented within CSEP 



Schorlemmer et al. 2007 2010b Zechar et al. 2010 . These include: (i) the N(umber)-test, based on the 
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290 consistency between the total number of observed and expected earthquakes; (ii) the L(ikehhood)-test, 

291 based on the consistency between the observed and expected joint log-hkehhood score of the forecast; (iii) 

292 the S(pace)-test, based on the consistency between the observed and expected joint log-likelihood score of 

293 the spatial distribution of earthquakes; and (iv) the M(agnitude)-test, based on the consistency between 

294 the observed and expected joint log-likelihood score of the magnitude distribution of earthquakes. 

295 The L-test proposed by Schorlemmer et al. 2007| is a relatively uninformative test, because the 

296 expected likelihood score is influenced by both the entropy of a model and the expected number of earth- 

297 quakes. As the expected number increases, the expected likelihood score decreases. Therefore, a model 

298 that overpredicts the number of earthquakes will tend to underpredict the likelihood score. Because the 

299 L-test is one-sided, i.e. a forecast is not rejected if the observed likelihood score is underpredicted, models 

300 that overpredict the number of earthquakes might pass the L-test trivially [for a concrete example, see 

301 \Zechar et fflZ.[poTo) p. 1190-1191]. As a remedy, we additionally used a conditional L-test [ Werner et al.\ 

302 2010a , in which the observed likelihood score is compared with expected likelihood scores conditional on 

303 the number of observed quakes. In contrast to the S or M-tests, the conditional L-test assesses the joint 

304 space-magnitude forecast, but it overcomes the sensitive dependence of the expected likelihood scores on 

305 the number of expected events. 



6 Results 

6.1 Testing Five- Year Forecasts on the CSI Catalog 



308 In Figure |2] we show the results of the N, L, S, and M-tests applied to the time-independent forecasts 

309 for the most recent five-year target period from 1998-2003 of the CSI catalog. We discuss each of the 

310 test results below. As a summary of all the results we present here and below. Tables [2] and [3] list all 

311 the tests that the forecasts fail for each of the considered target periods of the CSI and CPTI catalog, 

312 respectively. 

313 6.1.1 N-Test Results 

314 The N-test results in Figure[2^) show that only one forecast (Nanjo-et-AL.RI) can be rejected assuming 

315 Poisson confidence bounds because significantly more earthquakes were observed than expected. Using 



12 



316 NBD uncertainties, none of the forecasts can be rejected, because the confidence bounds are wider 

317 (typicaUy by several earthquakes on both sides). 

318 6.1.2 L-Test Results 

319 In Figure ^p), we show the resuhs of the unconditional and the conditional L-tests applied to the 

320 original (Poisson) forecasts. We did not try to apply NBD uncertainty to the rate forecasts in each 

321 space-magnitude bin, and therefore did not simulate likelihood values based on an NBD forecast. 

322 Only one forecast fails the unconditional L-test, while four fail the conditional L-test. The confidence 

323 bounds of the unconditional L-test are much larger because the number of simulated earthquakes is 

324 allowed to vary, thereby increasing the spread of the simulated likelihood scores. The impact of the 

325 expected number of earthquakes on the expected unconditional likelihood score is particularly visible 

326 for the forecasts Meletti-et-al.MPS04 and Nanjo-et-al.RI. The forecast Meletti-et-al.MPS04 

327 expects more earthquakes than were observed during this period (although not significantly more) and 

328 therefore also expects a likelihood score that is lower than observed. Moreover, the additional variability 

329 due to the increased number of events broadens the confidence bounds and the model thus passes the 

330 L-test. However, the the forecast fails the conditional L-test, because, given the number of observed 

331 earthquakes, the observed likelihood score is too small to be consistent with the forecast. Meanwhile, 

332 the forecast Nanjo-et-AL.RI underpredicts the number of quakes (assuming Poisson variability) and 

333 therefore overpredicts the likelihood score and fails the unconditional L-test. However, conditional on 

334 the number of observed earthquakes, the observed likelihood score is consistent with the forecast. 

335 To summarize, the conditional L-test reveals information that is separate from the N-test results and 

336 presents a stricter evaluation of the forecasts. In the remainder of this article, we will only consider the 

337 more informative conditional L-test results. From the results of the 1998-2002 target period, we can 

338 conclude that the joint distribution of the locations and magnitudes of the observed earthquakes are 

339 inconsistent with the group of ALM forecasts and the forecast Meletti-et-al.MPS04. 

340 6.1.3 Reference Forecast Prom a "Model of Most Information" 

341 To quantify the ability of the present time-independent forecasts to accurately predict the locations and 

342 magnitudes of the observed earthquakes, one can calculate the likelihood score of an ideal earthquake 

343 forecast (what might be called a successful prediction of the observed earthquakes - naturally with the 
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benefit of hindsiglit - or a forecast from a "model of most information" , as opposed to the "model of least 
information" Evison 1999| discussed next). For instance, working within the constraints of a Poisson 
distribution of events in each bin, one could calculate the likelihood score of a forecast that assigns an 
expected rate in each space-magnitude bin that is equal to the number of observed shocks within that 
bin. If at most one earthquake occurs per bin, the observed log-likelihood score of such a perfect forecast 
is the negative number of observed events. The score is only slightly smaller if more than one event 
occurs in a given bin. In Figure [2]3), the observed likelihood scores of the forecasts are evidently "far" 
from the score of a perfect forecast, which would roughly equal to —10. The typical scores of the forecasts 
lie in the region of —100, which implies that the likelihood of the data under the perfect forecast is about 



10 more likely than under a typical CSEP-Italy forecast. The information gain per earthquake Harte 



and Vere- Jones 



2005 



of the perfect forecast over a typical forecast is on the order of 10*. 
These numbers help quantify the difference between a perfect "prediction" within the current CSEP 
experiment design and a typical probabilistic earthquake forecast. One might imagine tracking this index 
of earthquake predictability to quantify the progress of the community of earthquake forecasters towards 
better models. However, the primary goal of CSEP's experiments is to test and evaluate hypotheses 
about earthquake occurrence, and the observed degree of predictability is sufficient to carry out this 
endeavor. 



6.1.4 Reference Forecast From a "Model of Least Information" 



One could equally construct a forecast from a "model of least information" Evison 1999 , often called the 



null hypothesis, which might be based on a uniform spatial distribution, a total expected rate equal to the 
observed mean over a period prior to the target period, and a Gutenberg- Richter magnitude distribution 
with b- value equal to one. Because several models already assume that (i) magnitudes are identically and 
independently distributed according to the Gutenberg-Richter magnitude distribution and (ii) the total 
expected rate is equal to the mean number of observed shocks, the only real difference between these 
models and an uninformative forecast lies in the spatial distribution. We therefore included the likelihood 
score of a spatially uniform forecast only in the S-test results. In Table SI of the electronic supplement. 



we additionally provide the information gains per earthquake Kagan and Knopoff 1977 Harte and 



Vere-Jones 2005| of each spatial forecast over a spatially uniform forecast for all the considered target 
periods. 
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373 6.1.5 S-Test and M-test Results 

374 The S-test and M-test results, shown in Figures [2j:) and d), suggest that the weakness of the group of 

375 ALM forecasts and the forecast Meletti-et-AL.MPS04 lies in forecasting the spatial distribution of 

376 earthquakes: all four forecasts fail the S-test with very small p-values, while all models pass the M- 

377 test. Additionally, the forecasts Gulia-Wiemer.HALM, Meletti-et-al.MPS04 and Schorlemmer- 

378 Wiemer.ALM obtain scores that are lower than the score of a uniform model of least information. 

379 In the case of the ALM group of forecasts, the low spatial likelihood scores leading to the S-test failures 

380 have a common origin. In roughly one half of all spatial bins, the three forecasts expect an extremely small 

381 constant number of earthquakes per spatial bin, indicating that a constant background rate was set in 

382 these cells. The forecasts Gulia-Wiemer. ALM and Gulia-Wiemer.HALM expect on the order of 10~* 

383 earthquakes in each spatial background bin, while the forecast Schorlemmer- Wiemer.ALM expects 

384 an even smaller 10~^^ earthquakes per bin. Accordingly, the probability of observing one earthquake in 

385 these bins is of the same order of magnitude. However, earthquakes do occur in some of these bins, and 

386 their occurrences in such low-probability (background) bins cause very low likelihood scores. Because 

387 these losses against a normalized uniform forecast, which expects roughly 10~^ earthquakes per bin to 

388 sum to the 10 observed quakes, are not compensated by equal or greater gains from earthquakes in 

389 regions where the forecasts are higher, the forecasts obtain extremely small spatial likelihood scores and 

390 fail the S-test. 

391 During the 1998-2002 period, the forecasts GuLiA- Wiemer.ALM and Gulia-Wiemer.HALM fail 

392 the S-test because of one Ml5.4 earthquake, located offshore north of Sicily at 39.06°N and 15.02°E, 

393 which occurred in such a background rate bin. Similarly, the forecast Schorlemmer- Wiemer.ALM 

394 fails the S-test because of a Ml5.1 earthquake at 37.93°N and 17.55°E on the south-eastern boundary 

395 of the testing region. Apart from two other events, the remaining seven earthquakes during this target 

396 period occurred in cells where the ALM forecasts expected more earthquakes than the uniform forecast. 

397 However, the gains achieved for these earthquakes do not compensate the losses incurred from the event 

398 in the background bins. 

399 The distribution of rates of the forecast Meletti-et-AL.MPS04 shows the existence of a similar 

400 background rate, although it is larger (10~* earthquakes per bin) than the background rates of the 

401 ALM forecasts. The occurrence of an earthquake in a background bin can therefore be more easily 

402 compensated by gains achieved from other earthquakes. However, during the 1998-2002 period, five 
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403 earthquakes occurred in such background bins, and the losses were not be masked by the gains. These 

404 five earthquakes include all four ofi'shore earthquakes during this period (including the two events that 

405 caused the ALM forecasts to fail), along with one additional shock of magnitude Ml5.3 at 46.697°N and 

406 11.07°E in northern Italy. 

407 6.1.6 Results from other Five- Year Target Periods of the CSI Catalog 

408 In Figure |3j we show the results of two further, separate five-year target periods from the CSI catalog: 

409 1988-1992 (circles) and 1993-1997 (squares). In combination with Figure[2] this provides insight into the 

410 variability of the five-year test results due to natural fiuctuations of seismicity. 

411 During 1988-1992, only three target earthquakes occurred. Although this number is small, it falls 

412 within the 95% confidence bounds of historical fluctuations (see Figure [T]). Six forecasts are rejected by 

413 the N-test because they overpredict the number of observed events. These forecasts are: Akinci-et- 

414 al.Hazgridx, Chan-et-al.HzaTI, Meletti-et-al.MPS04, Schorlemmer-Wiemer.ALM, Zechar- 

415 Jordan. CPTI, and Zechar-Jordan. Hybrid. As results from longer target periods below confirm, this 

416 group consistently overpredicts the total rate. The modelers of the forecasts Akinci-et- al.Hazgridx, 

417 Chan-et-al.HzaTI, Meletti-et-al.MPS04, Schorlemmer-Wiemer.ALM, Zechar-Jordan. CPTI, 

418 and Zechar-Jordan. Hybrid indicated to us that they calibrated their models on the moment magnitude 

419 scale rather than the local magnitude scale used for prospective testing, leading to an overprediction of 

420 the number of earthquakes with local magnitude AIl > 4.95. This error in the calibration complicates 

421 the interpretation of the N-test results for this group of models. 

422 As before, we observe differences in the results from the NBD and Poisson N-tests. During 1988-1992, 

423 the forecast Gulia-Wiemer.HALM is rejected by the N-test assuming Poisson confldence bounds, but 

424 the more realistic NBD uncertainties allow the forecast to pass. Similarly, the forecast Nanjo-et-AL.RI 

425 fails the Poisson N-test but passes the NBD N-test during 1993-1997. 

426 The conditional L-test results indicate that in the case of the forecast Schorlemmer-Wiemer.ALM, 

427 the three earthquakes during 1988-1992 suffice to reject the model. Results from the 1993-1997 period 

428 again show rejections of the ALM group of forecasts. However, in contrast to the 1998-2002 period, 

429 the forecast Meletti-et-AL.MPS04 passes both periods. Results from longer target periods, presented 

430 below, are necessary to judge this forecast conclusively. 

431 The combined S and M-test results again locate the source of the ALM rejections in the spatial 
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432 dimension of the forecast. Moreover, Schorlemmer-Wiemer. ALM continues to perform worse than 

433 a uniform model during both target periods. During the 1993-1997 target period, the forecasts fail 

434 because of a Ml5.8 earthquake in 1994 at 39.398°N and 15.21°E offshore to the north of Sicily, which 

435 occurred in a background bin. The large resulting likelihood loss cannot be compensated by the gains 

436 achieved from the other eight observed earthquakes. During the 1988-1992 target period, the forecasts 

437 GuLiA-WiEMER.ALM and Gulia-Wiemer.HALM pass the S-test, but the forecast Schorlemmer- 

438 WiEMER.ALM receives a low likelihood score because of an uncompensated likelihood loss due to a 

439 MlS.I earthquake in 1990 that occurred in a low-probability (but not background) bin at 37.33°N and 

440 15.24°E offshore and east of Mount Etna. Additionally, the forecast Zechar- Jordan. CPTI scored 

441 marginally less than a uniform forecast, although the score is consistent with the forecast's expectation. 

442 The M-test results thus far, and for all but the longest of the target periods considered below, are 

443 not very informative: no rejections occur. The individual model distributions are very similar, indicating 

444 that the differences between the predicted magnitude distributions are small. The differences between 

445 the observed likelihood scores are equally small. 

446 To summarize, some of the test results vary with the considered five-year target period, while others 

447 are robust. Schorlemmer-Wiemer. ALM consistently shows poor performance in the spatial forecast, 

448 while the other two ALM forecasts are rejected in two of three target periods. Meletti-et-AL.MPS04 

449 fails the conditional L and S tests during one of three five-year target periods. 

450 6.2 Testing Ten- Year Forecasts on the CPTI Catalog 

451 In Figure [4] we summarize the results of the N, conditional L, S and M-tests for the time-independent 

452 models and five non-overlapping ten-year target periods of the CPTI catalog. These results mimic 

453 the prospective ten-year experiment and help gauge the variability of the results. The online material 

454 that accompanies this article (available at http: //mercalli . ethz . ch/~mMeriier/CSEP_ITALY/ESUPP/ 1 

455 provides additional figures of the forecasts, maps of their likelihood ratios against a uniform forecast, 

456 and concentration diagrams Rong and Jackson 2002 Kagan 2009| for the entire CPTI data set from 

457 1901 until 2006. Because the figures are based on the longest target period, which we consider explicitly 

458 in section [673) they include all earthquakes observed during the ten-year target periods and provide an 

459 informative visual presentation of the results. 
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6.2.1 N-Test Results 

The N-test results are shown in panel a) of Figure [4] The numbers of observed shocks during the 
five ten-year periods from were 15, 18, 13, 8 and 23. For the remainder of this article, we do not 
discuss the N-test results from the group of models that were wrongly calibrated on the moment mag- 



nitude scale (see section 6.1.61. Of the remaining six forecasts, none could forecast all five observations 
within the 95% confidence bounds of the Poisson distribution. Five forecasts - Gulia-Wiemer. ALM, 
GuLiA-WiEMER.HALM, Werner-et-al.CSI, Werner-et-al. Hybrid and Zechar-Jordan.CSI - are 
rejected only during one of the five periods when assuming Poisson confidence bounds and cannot be 
rejected at all when considering confidence bounds based on an NBD. 

The forecast Nanjo-et-AL.RI expects far fewer shocks than the other forecasts and consistently 
underpredicts the number of earthquakes. Assuming the original Poisson variability in the number of 
shocks, the forecast is rejected during four of the five target periods. However, the forecast cannot be 
rejected at all if NBD confidence bounds are used. 

6.2.2 Conditional L-Test Results 

The conditional L-test results are displayed in panels b) to f ) of Figure [4] The only robust result is 
the continued failure of the forecast Schorlemmer-Wiemer. ALM. The forecasts Gulia-Wiemer. ALM 
and Gulia-Wiemer. HALM fail the test during two periods, while Nanjo-et-al.RI and Werner-et- 
al.CSI are both rejected during 1967-1976. Reasons for these rejections are discussed in the context of 
the S and M-test results below. 

The forecast Meletti-et-AL.MPS04 obtains an observed joint-log-likelihood score of negative infinity 
during the target period 1967-1976. This score results from the fact that one earthquake occurred in 
a space-magnitude bin in which the forecasted rate was zero. A zero forecast is equivalent to saying 
that target earthquakes are impossible in this bin, and if an event does occur in this bin, the forecast is 
automatically rejected. The earthquake in question, the 1968 Belice earthquake, occurred on 15 January 
1968 in western Sicily at 37.76°N and 12.98°E with a magnitude M_l6.39 and caused several hundred 
fatalities. According to the forecast, however, earthquakes larger than Af^ = 6.25 are impossible in 
this spatial bin because the forecasted rates in the magnitude bins are non-zero only for magnitudes 
up to Ml ~ 6.25. The forecast's rejection implies that the maximum magnitude set for this location 
was too small; the discrepancy might be due to the wrong magnitude calibration reported above and/or 
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489 indicate that the maximum magnitude may require a modification in this area. (The forecast Meletti- 

490 ET-AL.MPS04 does not fail the S-test because the forecast in this particular spatial cell is non-zero when 

491 summed over the individual magnitude bins.) 

492 6.2.3 S-Test and M-test Results 

493 In panels g) through k) of Figure |4] the S-test results are shown. Five (spatial) forecasts cannot be re- 

494 jected by the S-test during any of the five target periods: Akinci-et-AL.Hazgridx, Chan-et-AL.HzaTI, 

495 Werner-et-al. Hybrid, Zechar-Jordan.CPTI and Zechar-Jordan. Hybrid. 

496 The two forecasts Werner-et-AL.CSI and Zechar-Jordan.CSI, which were optimized on the CSI 

497 catalog, both fare well during the target periods that are also or at least partially covered by the CSI 

498 catalog, i.e. from 1981 onwards. However, the two forecasts are rejected during the two earliest target 

499 periods, which can be considered as out-of-sample tests for these two forecasts. During the 1957-1966 

500 period, the forecasts fail to predict several difluse earthquakes in northern Italy and two offshore earth- 

501 quakes between the Ligurian coast and Corsica. The 1967-1976 period contains the 1968 western Sicily 

502 earthquake sequence (including the above-mentioned Ml = 6.39 Belice earthquake), which occurs in 

503 spatial cells with low expected rates. Evidently, the CSI catalog contains little seismicity in these regions 

504 from which the models could have anticipated the occurrence of these earthquakes. 

505 Interestingly, the forecast Nanjo-et-AL.RI, which was also calibrated on CSI data, only fails during 

506 the 1967-1977 period (again due to the western Sicily sequence in 1968) but passes during the 1957- 

507 1966 interval. The model employs a relatively coarse grid to forecast earthquakes (see Figure S6 of the 

508 electronic supplement), and this characteristic helped forecast the offshore quakes north of Corsica better 

509 than the Werner-et-al.CSI and Zechar-Jordan.CSI forecasts. 

510 The three ALM-based forecasts continue to forecast poorly the spatial distribution of observed earth- 

511 quakes. During the 1957-1966 target period, the two above-mentioned earthquakes north of Corsica and 

512 a shock in northern Italy occur in background bins of all three ALM forecasts, leading to their S-test 

513 failures. During the 1967-1976 target periods, the Gulia-Wiemer.ALM and Gulia-Wiemer.HALM 

514 forecasts fail because of three earthquakes in background bins: two shocks occurred as part of the 1968 

515 western Sicily earthquake sequence and one in central Italy at 44.81°N and 10.35°E. While none of 

516 these events (nor any others) occur in background bins of the forecast Schorlemmer-Wiemer. ALM 

517 during this period, two earthquakes of the 1968 western Sicily sequence, as well as an earthquake at 
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518 41.65°N and 15.73°E, do incur unexpectedly low likelihood scores, resulting in the S-test rejection. In 

519 fact, SCHORLEMMER-WlEMER. ALM fails all considered ten-year target periods. Whenever the spatial 

520 likelihood score falls below a uniform forecast, at least one earthquake occurred in a so-called background 

521 cell. 

522 The forecast Meletti-et-AL.MPS04 is rejected twice by the S-test. During the period 1957-1966, 

523 the forecast fails because of the two recurring offshore earthquakes north of Corsica in July 1963 and 

524 because of two earthquakes in northeastern Italy, all of which occurred in background bins. During 

525 1987-1996, three earthquakes occurred in background bins: (i) an offshore earthquake on April 26, 1988, 

526 at 42.21°N and 16.66°E; (ii) an Ml ~ 5.43 aftershock of the Potenza, southern Italy, earthquake of May 

527 5, 1990; and (iii) an AIl ~ 5.54 offshore earthquake on December 13, 1990, east of Mount Etna in the 

528 Sea of Sicily. 

529 6.3 Test Results from Longer Periods 

530 The long-term forecasts submitted for the CSEP-Italy experiment were calculated for five-year and ten- 

531 year periods. Because the forecasts are time-independent and characterized by Poisson uncertainty, one 

532 can test suitably scaled versions of the forecasts on longer time periods: 18 years (the duration of the 

533 reliable part of the entire CSI catalog, from 1985 through 2002), 57 years (the duration of the most 

534 reliable part of the CPTI catalog, from 1950 through 2006), and 106 years (the entire CPTI catalog). In 

535 this section, we present the results of testing these scaled forecasts. The online material presents further 

536 figures of the forecasts, likelihood ratios and concentration diagrams based on the 106-year target period. 

537 The test results of the 18-year period from 1985 to 2002 of the CSI catalog are shown in Figure 

538 [5] Twenty-three earthquakes occurred during this period. The N-test results reveal the same features 

539 already observed previously: a group of models overpredicts the number of earthquakes (Akinci-et- 

540 AL.HAZGRIDX, ChAN-ET-AL.HzATI, MELETTI-ET-AL.MPS04, SCHORLEMMER-WlEMER. ALM, ZeCHAR- 

541 Jordan. CPTI, Zechar- Jordan. Hybrid). While the confidence bounds of the negative binomial dis- 

542 tribution remain substantially wider than the bounds based on the Poisson distribution, there are only 

543 two forecasts for which the test results are ambiguous (Akinci-et-AL.Hazgridx and Nanjo-et-AL.RI). 

544 The ALM forecasts and the Meletti-et-AL.MPS04 forecast fail the conditional L-test and the S-test, 

545 with SCHORLEMMER-WlEMER. ALM scoring less than a uniform spatial forecast. The failures are due to 

546 the earthquakes we discussed previously that occur either in background bins or in locations with low 
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547 expected rates. 

548 Increasing the duration of the retrospective tests to the 57 most recent years of the CPTI catalog 

549 (1950-2007) yields 83 earthquakes and leads to similar results but with greater statistical significance (Fig- 

550 ure[6|. In addition to the rejections mentioned in the preceding paragraph, the N-test now unequivocally 

551 rejects the forecasts Akinci-et-AL.Hazgridx and Nanjo-et-AL.RI, even when the confidence bounds 

552 of an NBD are considered. The conditional L-test rejects the forecast Meletti-et-AL.MPS04 because 

553 of a likelihood score of negative infinity (discussed in section 6.2.21. The S-test results show that the 

554 forecast Nanjo-et-AL.RI can be rejected in addition to the ALM forecasts and Meletti-et-AL.MPS04. 

555 No forecasts can be rejected by the M-test, despite 57 years of data. 

556 The longest period over which we evaluated the scaled forecasts was 106 years, spanning the full 

557 duration of the CPTI08 catalog and containing 183 earthquakes (Figure [Tj see the online material for 

558 maps of the forecasts, likelihood ratios and concentration diagrams). The N-test results now show a 

559 clear separation between the group of forecasts that consistently overpredict, the forecast Nanjo-et- 

560 AL.RI, which underpredicts, and the forecasts that cannot be rejected by assuming confidence bounds 

561 based on either a Poisson or a negative binomial distribution. Application of the conditional L-test 

562 additionally rejects the forecasts Nan,70-et-al.RI and Werner- et-al.CSI, while the S-test now also 

563 fails Werner-et-al.CSI and Zechar-Jordan.CSI. 

564 Interestingly, four forecasts fail the M-test: Akinci-et-AL.Hazgridx, Chan-et-AL.HzaTI, Meletti- 

565 ET-AL.MPS04 and Nanjo-et-AL.RI. In Figure [Sj we compare the observed with their predicted mag- 

566 nitude distributions. For reference, we added a pure Gutenberg-Richter (GR) distribution with b- 

567 value equal to one, which passes the M-test. The magnitude distributions predicted by Akinci-et- 

568 AL.Hazgridx and Nanjo-et-AL.RI are close to exponential, but with b- values larger than one. As 

569 a result, large earthquakes are less likely, and the forecasts are penalized for the occurrence of three 

570 m > 7 earthquakes. The magnitude distribution of the forecast Chan-et-AL.HzaTI seems to reflect its 

571 non-parametric kernel estimation method (see section[2]| and also underpredicts the rate of large shocks. 

572 Finally, the magnitude distribution of Meletti-et-AL.MPS04 is non-monotonic: several characteristic 

573 magnitude bulges can be seen. However, the largest events occur between the bulges, for which the 

574 forecast is penalized. 
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7 Discussion and Conclusions 



7.1 The Role of the Poisson Distribution in the Forecast Specification 



577 The assumption of Poisson rate variability in the CSEP- Italy experiments (as well as other CSEP exper- 

578 iments, including RELM [FieM[ [2007| l^c/torfemmer et aZ.||2007| ) has certain advantages. In particular, 

579 this is a simplifying assumption: because the Poisson distribution is defined by a single parameter, the 

580 forecasts do not require a complete probability distribution in each bin. Moreover, Poisson variability 

581 has often been used as a reference model against which to compare time-varying forecasts, and it yields 

582 an intuitive understanding. 

583 Despite these advantages, however, this assumption is questionable, and the method of forcing each 

584 forecast to be characterized by the same uncertainty is not the only solution [see also the discussion by 

585 Zechar et al. 2010 . Werner and Somette 2008 remarked that most forecast models generate their 

586 own likelihood distribution, and this distribution depends on the particular assumptions of the model; 

587 moreover, there is no reason to force every model to use the same form of likelihood distribution. The 

588 effect of this forcing is likely stronger for time-dependent, e.g. daily forecasts Lombardi and Marzocchi 

589 2010 , and it is difficult to judge (without the help of modelers) the quality of approximating each model- 

590 dependent distribution by a Poisson distribution. On the other hand, one can check whether or not the 

591 Poisson assumption is appropriate with respect to observations. In section|4?3j we showed that the target 

592 earthquake rate distribution is approximated better by an NBD than by a Poisson distribution. Therefore, 

593 time-independent forecasts that predict a Poisson rate variability necessarily fail more often than expected 

594 at 95% confidence because the observed distribution differs from the model distribution. To improve time- 

595 independent forecasts, the (non-Poissonian and potentially negative binomial) marginal rate distribution 

596 over long timescales needs to be estimated. However, the parameter values of the rate NBD change 

597 as a function of the temporal and spatial boundaries of the study region over available observational 

598 periods Kagan 2010 . Whether a stable asymptotic limit exists (loosely speaking, whether seismic 

599 rates are stationary) remains an open question. For time-dependent models, on the other hand, several 

600 classes exist which are capable of producing a rate NBD over finite time periods including branching 

601 processes Kagan 2010 and Poisson processes with a stochastic rate parameter distributed according to 

602 the Gamma distribution. 

603 Despite this criticism, it is unlikely that the Poisson distribution would be replaced by a model- 
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dependent distribution that is substantially different, particularly for long-term models. Therefore, the 
p-values of the test statistics used in the N, L, S and M-tests might be biased towards lower values, but 
they do provide rough estimates. Nevertheless, one should bear in mind that a quantile score that is 
outside the 95% confidence bounds of the Poisson distribution may be within the acceptable range if 
a model-dependent distribution were used. As an illustration, and to explore the effect of the Poisson 
assumption in these experiments, we created a set of modified forecasts with rate variability estimated 
from the observed history. The width of the 95% confidence interval of the total rate forecast increased, 
in certain cases substantially. Several forecasts are rejected if a Poisson variability is assumed, while 
they pass the test under the assumption of an NBD. Overall, however, the p-values (quantile scores) 
of the test statistics based on the Poisson approximation often give good approximate values. Only in 
borderline cases did the Poisson assumption lead to (potentially) false rejections of forecasts. 

The modified forecasts based on an NBD are not an entirely satisfactory solution to the problem. 
First, the model distribution in each bin should arise naturally from a model's hypotheses, rather than 
an empirical adjustment made by those evaluating the forecast. Second, even if a negative binomial 
distribution adequately represents the distribution of the total number of observed events in an entire 
testing region, one should specify the parameter values for each bin to make the non-Poisson forecasts 
amenable also to the L-, S and M-tests. Therefore, future experiments should allow forecasts that are 
not characterized by Poisson rate uncertainty. 

More generally, future experiments might consider other forecast formats and additional model classes. 
For example, stochastic point process models provide a continuous likelihood function which can char- 
acterize conditional dependence in time, magnitude and space (and focal mechanisms, etc.). As a result, 
full likelihood-based inference for point processes and tools for model-diagnostics are applicable to this 



class of models [e.g. Ogata 1999 Daley and Vere-Jones 2003 Schoenberg 2003 . However, when con- 
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sidering new classes of forecasts, one may keep in mind that a major success of the RELM and CSEP 
experiments was the homogenization of forecast formats to facilitate comparative testing. 

7.2 Performance and Utility of the Tests 

We explored results from the N, L, S and M-tests in this study because they are the "staple" CSEP 
tests. Other metrics for evaluating forecasts should certainly be considered, especially with regard to 



alarm-based tests [e.g. Molchan and Keilis-Borok 2008 Zechar and Jordan 20081 and further conditional 
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likelihood tests Zechar et al. 2010| . Overall, the N, L, S and M-tests are intuitive and relatively easy to 
interpret. However, we demonstrated a weakness in the L-test and replaced it with a conditional L-test 
that better assesses the quality of a forecast [see also | Werner et 0!] |2010a| . Among the metrics, the 
S-test results were the most helpful in tracking down the weak features of forecasts, because the biggest 
differences between time-independent models lie in their spatial forecasts. 

The M-test results were largely uninformative. Because the magnitude distributions considered here 
were so similar, this result is not surprising; indeed, it is in accordance with the statistical power ex- 



ploration of Zechar et al. 2010 . No forecast could be rejected for target periods ranging from 5 to 57 
years. Different tests, such as a traditional Kolmogorov-Smirnov (KS) test, should be compared with 
the current likelihood-based M-test, particularly in terms of statistical power. 

The current status quo in CSEP experiments is to reject a forecast if it fails a single test at 95% 
confidence. As we discussed above, the actual p- values provide a more meaningful assessment than a 
simple binary yes/no statement because the assumed confidence bounds may not accurately represent 
the model uncertainty. Furthermore, as the suite of tests grows, we should be concerned with joint 
confidence bounds of the ensemble of tests, rather than the individual significance levels of each test. 
Joint confidence bounds can be obtained from model simulations. A global confidence bound for the 
multiple tests can then be established. A similar question will arise when forecasts from the same model 
are tested within nested regions, as will be the case when considering the performance of a model's 
forecast for Italy with that for the entire globe. 

Finally, future experiments may consider developing tests that address particular characteristics of a 
forecast [see also the discussion \yy \Zechar et aL[|2010| . For example, a forecast might be a reflection of 
the hypothesis that the magnitude distribution varies as a function of tectonic setting. In this context, 
an M-test conditioned on the spatial distribution of observed earthquakes would provide a sharper test. 



7.3 Overall Performance of the Forecasts 



657 A summary of all results can be found in Tables [2] and [S] The Poisson N-test is possibly the strictest test 

658 within the present context, because none of the forecasts pass every N-test of the different periods. 

659 On the other hand, five forecasts pass all the N-tests with confidence bounds based on a negative 

660 binomial distribution (Gulia-Wiemer. ALM, Gulia-Wiemer.HALM, Werner-et-al.CSI, Werner- 

661 ET-AL. Hybrid and Zechar-Jordan.CSI). As we mentioned, several modelers indicated to us that their 
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662 forecasts were calibrated on the moment magnitude scale. As a result, it is difficult to interpret their 

663 overpredictions beyond the obvious statement that the forecasts were poorly calibrated. The forecast 

664 Nanjo-et-al.RI is the only forecast that expects substantially fewer earthquakes than the observed 

665 sample mean, although the forecast fails the NBD N-test only for the longest of the considered target 

666 periods. Those forecasts that expect a number of shocks equal to the sample mean over their calibration 

667 period predict the number of quakes well, as should be expected. 

668 With one important exception, the conditional L-test results largely reflect the S-test results, because 

669 the predicted magnitude distributions were consistent with observations from all but the 106-year target 

670 period. The exception concerns the occurrence of an earthquake in a space-magnitude bin in which an 

671 earthquake should have been impossible according to the forecast: the 1968 Ml = 6.39 Belice earthquake 

672 happened in a spatial cell in which the forecast Meletti-et-al.MPS04 set a maximum magnitude of 

673 Ml = 6.25. The discrepancy might be explained by the wrong magnitude conversion that the authors 

674 adopted and/or it may suggest that the model's assumptions regarding the spatial variation of maximum 

675 magnitudes may need to be revised. However, if we had tested the forecast against the moment magnitude 

676 of the Belice earthquake (Mvi'6.33, according to the CPTI catalog), the forecast would have still failed, 

677 thus pointing towards the latter explanation. 

678 The S-test results provided the most insight into the weaknesses of the forecasts. Only five fore- 

679 casts pass all S-tests (Akinci-et-al.Hazgridx, Chan-et-al.HzaTI, Werner-et-al. Hybrid, Zechar- 

680 Jordan. CPTI and Zechar- Jordan. Hybrid). These forecasts fit the spatial distribution of the CSI and 

681 CPTI catalogs well, although they might overfit and perform poorly in the future. The models are also 

682 among the simplest, especially when compared to the forecast Meletti-et-al.MPS04. However, the 

683 forecasts Werner-et-al. CSI and Zechar-Jordan.CSI, which were calibrated on CSI data, cannot 

684 adequately forecast the spatial locations of earthquakes during the time period before the CSI data be- 

685 gins. This might indicate that the models are not smooth enough and do not anticipate sufficiently quiet 

686 regions becoming active. 

687 The ALM group of forecasts (Gulia-Wiemer.ALM, Gulia-Wiemer.HALM and Schorlemmer- 

688 Wiemer.ALM) consistently fail the S-tests, and often perform worse than a uniform forecast, because 

689 isolated earthquakes occur in extremely low-probability "background" bins that cover roughly 50% of 

690 the region. We could not identify a common characteristic among the earthquakes that occurred in back- 

691 ground bins. The incurred likelihood losses cannot be compensated by the gains achieved by adequately 
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692 forecasting the majority of earthquakes. The results suggest that the ALM forecasts are overly optimistic 

693 in ruhng out earthquakes in their background bins, i.e. the models are not smooth enough. 

694 The forecast Meletti-et-AL.MPS04 also often fails the S-test because of a minority of earthquakes 

695 that occur in low-probability regions. Almost all earthquakes that incur likelihood losses are located 

696 offshore. But while the forecast performs substantially better onshore, a few surprising onshore earth- 

697 quake locations remain. Poor performance of a forecast for offshore earthquakes potentially raises the 

698 problem of the "weight" of each earthquake in the testing procedure. Specifically, if a model is intended 

699 for the practical purpose of seismic hazard assessment, then a rejection of its forecast due to offshore 

700 earthquakes may not have the same importance as a rejection due to earthquakes in regions of higher 

701 exposure and/or vulnerability. 

702 Eight forecasts pass all M-tests (Gulia-Wiemer. ALM, Gulia-Wiemer.HALM, Schorlemmer- 

703 Wiemer.ALM, Werner-et-al.CSI, Werner-et-al. Hybrid, Zechar-Jordan.CPTI, Zechar- Jordan. CSI 

704 and Zechar-Jordan. Hybrid). Five of them are based on a simple Gutenberg-Richter distribution with 

705 uniform b-value equal to one (Werner-et-al.CSI, Werner- et-al. Hybrid, Zechar-Jordan.CPTI, 

706 Zechar-Jordan. CSI and Zechar- Jordan. Hybrid). This suggests that the hypothesis of a universally 

707 applicable, uniform Gutenberg-Richter distribution with b-value equal to one [e.g. Bird and Kagan 2004| 

708 cannot be ruled out for the region of Italy. 

709 Four forecasts fail the M-test during the 1901-2007 target period of the CPTI catalog. The mag- 

710 nitude distributions of the forecasts Akinci-et-al.Hazgridx, Nanjo-et-al.RI, Chan-et-al.HzaTI 

711 and MELETTI-ET-AL.MPS04 do not adequately forecast the largest magnitudes and the three observed 

712 Ml > 7, in particular. In the case of the Akinci-et-AL.Hazgridx and Nan.io-et-AL.RI forecasts, the 

713 reason seems to be a b-value of the Gutenberg-Richter distribution that is too large. The non-parametric 

714 estimate of Chan-et-AL.HzaTI also decays too quickly. The magnitude distribution of Meletti-et- 

715 AL.MPS04 reveals several characteristic magnitude values of elevated rates, but earthquakes also occur 

716 between them in extremely low-probability bins. However, these results should be interpreted cautiously 

717 because the same magnitude forecasts pass the 1950-2007 period, and because the greater uncertainty of 

718 the data prior to 1950 arguably influences the results. 
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719 7.4 Value of Retrospective Evaluation 

720 The initial submission deadline for long-term earthquake forecasts for CSEP-Italy was 1 July 2009. 

721 Because the formal experiment was not intended to start until 1 August 2009, there was a brief period 

722 for initial analysis and quality control of the submitted forecasts. We provided a quick summary of the 

723 features of the forecasts and preliminary results of a retrospective evaluation to the modelers during this 

724 period. As a result, six of the eighteen time- independent and time-dependent long-term forecasts were 

725 modified and resubmitted before the final deadline on 1 August 2009. This initial quality control period 

726 was therefore extremely useful, and future experiments might consider expanding and formalizing the 

727 initial quality control period. 

728 The short one-month period was, however, too short to evaluate the forecasts retrospectively in the 

729 detail we present here. During the course of this study, the problem of the wrong magnitude scaling 

730 was discovered. Because at least 4 of the 18 forecasts are affected, a second round of submissions was 

731 solicited for 1 November 2009, and 15 revisions (and 2 new forecasts) were submitted. This suggests 

732 that the feedback provided to modelers based on the present study was useful and informative. The 

733 task of converting even a relatively simple hypothesis into a testable, probabilistic earthquake forecast 

734 should not be underestimated, and we suggest that future experiments include some form of retrospective 

735 testing prior to final submission. 

736 The retrospective evaluation also showed that at least the time-independent forecasts can be evaluated 

737 in a meaningful manner and that useful information about the models can be extracted. Such information 

738 is critical for the development of better forecasts and for the evaluation of the underlying hypotheses of 

739 earthquake occurrence. 

740 At the same time, retrospective evaluation cannot replace the prospective tests with zero degrees of 

741 freedom. Given the relative robustness of the results from the retrospective evaluation, we anticipate that 

742 the prospective experiment will provide further useful and more definite information about the quality 

743 of the forecasts. Most importantly, if the second round of forecast submissions contains significantly 

744 improved forecasts with fewer technical errors, we expect to see real progress in our understanding of 

745 earthquake predictability. 
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Data and Sharing Resources 



We used two earthquake catalogs for this study: the Catalogo Parametnco dei Terremoti Itaham 
(Parametric Catalog of Italian Earthquakes, CPTI08) \Rovida and the CPTI Working Group\ |2008| 
and the Catalogo del la Sismicit'a Italiana (Catalog of Italian Seismicity, CSI 1.1) \Castello et al.\ 



2007 Chiarabba et al. 2005 . The particular versions of the catalogs we used are available at http: 



|//www. cseptesting. org/regions/italy} 
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Model 


1988-1992 


1993-1997 


1998-2002 


1985-2003 


Akinci-et-al.Hazgridx 


N+ 






N+ 

N+ 


Chan-et- AL . HzaTI 


N+ 






GULIA-WlEMER.ALM 




L, S 


L, S 


L, S 


GULIA-WlEMER.HALM 


N+ 
p 


L, S 


L, S 


L, S 


Meletti-et-al . MP S04 


N+ 




L, S 


N+, L, S 


Nanjo-et-al.RI 




N; 


N-,L 
L, S 


N; 
N+, L, S 


Schorlemmer-Wiemer.ALM 


N+, L, S 


L, S 


Werner-et-al.CSI 










Werner-et-al. Hybrid 










Zechar- Jordan. CPTI 


N+ 






N+ 


Zechar- Jordan. CSI 










Zechar- Jordan . Hybrid 


N+ 






N+ 



Table 2: Summary results of the forecast tests obtained using the CSI catalog. For each model and each experiment time 
period, the tests which the forecast failed are denoted, using a 5% critical significance value. For the N-test, N+ 
indicates that the forecast overpredicted the observed rate, N- indicates an underprediction; the subscript p indicates 
that the forecast only failed when assuming Poisson uncertainty, otherwise it failed under the Poisson and NBD. 



CPTI 

Model 57-66 67-76 77-86 87-96 97-06 1950-2006 1901-2006 

Akinci-et-al.Hazgridx N+ N+ N+ N+ 

Chan-et- AL. HzaTI N+ N+ N+ N+ N+ N+ 

Gulia-Wiemer.ALM L, S L,S N+ S L, S L,S 

Gulia-Wiemer.HALM L, S L, S N+ S N+, L, S L,S 

MELETTI-ET-AL.MPS04 N+, S N+, L N+ N+, L, S N+, L, S N+, L, S 

Nanjo-et-al.RI N", L, S N", S N", L, S 

Schorlemmer-Wiemer.ALM N+,L,S L,S N+, L, S N+,L,S L, S N+, L, S N+, L, S 

Werner-et-al.CSI S L, S N" N", L, S 

Werner-et-al. Hybrid N~ N~ 

Zechar- Jordan. CPTI N+ N+ N+ N+ N+ N+ 

Zechar- Jordan. CSI S S N" N", S 

Zechar-Jordan. Hybrid N+ N+ N+ N+ N+ 



Table 3: Summary results of the ten-year forecast tests obtained using the CPTI catalog. For each model and each experiment 
time period, the tests which the forecast failed are denoted, using a 5% critical significance value. For the N-test, 
N+ indicates that the forecast overpredicted the observed rate, N~ indicates an underprediction; the subscript p 
indicates that the forecast only failed when assuming Poisson uncertainty, otherwise it failed under the Poisson and 
NBD. 
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Interval Number of Earthquakes 

Figure 1: Left: Observed number of earthquakes in twenty non-overlapping five-year intervals in the CPTI catalog from 1907 
until 2006 (inclusive) (bars), mean number of observed events (solid line) along with two-sided 95% confidence 
bounds from the Poisson distribution ^ and the negative binomial distribution (NBD) (jsjl. Right: Empirical 
cumulative distribution function (solid black line), along with fits to the data using a Poisson distribution (solid 
grey line) and an NBD (dashed black line). Also shown are the Akaike Information Criterion (AIC) values of the 
fitted distributions Im. 
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N-Test 



CSI 5 year period 1998-2002 



(Conditional) L-Test 
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- Ghar-el-al.HzaTI 

- Gulia-Wiemer.ALM 
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- NanJo-el-al.RI 
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- Wemer-et-a I. Hybrid 

- Zechar-Jordan.GPTI 

- Zechar-Jordan.CSI 
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Log-likelihood (rafe-space-magnitude) 



- Akinci-el-al.HA2GRIDX 
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- Werner-el-al.CSI 

- We rner-et-al. Hybrid 

- Zechar-Jordan.CPTI 

- Zechar-Jordan.CSI 

- Zechar-Jordan. Hybrid 




Log-likelihood (magnitude) 



Figure 2: Results of the (a) N-test, (b) unconditional and conditional L-tests, (c) S-test and (d) M-test of the 5-year time- 
independent forecasts using the 5-year target period from 1998 to 2002 of data from the CSI earthquake catalog. 
Red and green symbols indicate rejected and passed forecasts, respectively. In (a), green symbols with red edges 
indicate that the Poisson forecast was rejected while the NBD forecast was passed. In (b), green symbols with red 
edges indicate that only one of the two L-tests was passed. Black crosses: (a) expected number of earthquakes, (b) 
expected unconditional or conditional log-likelihood score, (c) expected spatial log-likelihood score, (d) expected 
magnitude log-likelihood score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model 
forecast assuming a Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast 
assuming a negative binomial distribution. In (b), black (grey) bars denote 95% confidence bounds of the conditional 
(unconditional) likelihood score. Vertical lines in S-test figures indicate the likelihood score of a spatially uniform 
model. 
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N-Test 



CSI 5 year periods 1 988-1 997 

L-Test:1 988-1 992 



L-Test:1 993-1 997 
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Figure 3: Results of the (a) N-test, (b-c) conditional L-test, (d-e) S-test and (f-g) M-test of the 5-year time-independent 
forecasts using two separate 5-year target periods of data from the CSI earthquake catalog. Red and green symbols 
indicate rejected and passed forecasts, respectively. Green symbols with red edges indicate that the Poisson forecast 
was rejected while the NBD forecast was passed. Black crosses: (a) expected number of earthquakes, (b) expected 
conditional log- likelihood score, (c) expected spatial log-likelihood score, (d) expected magnitude log-likelihood 
score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model forecast assuming a 
Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast assuming a negative 
binomial distribution. Vertical lines in S-test figures indicate the likelihood score of a spatially uniform model. 
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CSI 18 year period 1985-2002 



Conditional L-Test 



- Akinci-el-al.HAZGRIDX 

- Chan-el -al.HzaTI 

- Gulia-Wiemer.ALM 

- Guha-Wiemer.HALM 

- Mele(li-et-al.MPS04 

- Nanjo-el-al.RI 

- Schorlemmer-Wiemer.ALM 

- Werner-el -al.CS I 

- We rner-el-al. Hybrid 

- Zechar-Jorden.CPTI 

- Zechar-Jordan.CSI 

- Ze char- Jordan. Hybrid 



-220 -200 -180 -160 

Log-likellhood(rate-space-magnilude) 



Akinci-el-al.HAZGRIDX 

Char-et-al.HzaTI 

Gulia-Wiemer.ALM 

Gulia-Wiemer.HALM 

Melet1i-el-al.MPS04 

Nanjo-et-al.RI 

Schorlemmer-Wiemer.ALM 

Werner-el-al.CSI 

Werrer-el-al. Hybrid 

Zechar-Jordan.CPTI 

Zechar-Jordan.CSI 

Zechar-Jordan. Hybrid 



Log-likelihood (space) 



Log-likelihood (magnilude) 



Figure 5: Results of the (a) N-test, (b) conditional L-test, (c) S-test and (d) M-test of the scaled 10-year time- independent 
forecasts using the 18-year target period from 1985 to 2002 of data from the CSI earthquake catalog. Red and 
green symbols indicate rejected and passed forecasts, respectively. Green symbols with red edges indicate that 
the Poisson forecast was rejected while the NBD forecast was passed. Black crosses; (a) expected number of 
earthquakes, (b) expected conditional log-likelihood score, (c) expected spatial log-likelihood score, (d) expected 
magnitude log-likelihood score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model 
forecast assuming a Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast 
assuming a negative binomial distribution. Vertical lines in S-test figures indicate the likelihood score of a spatially 
uniform model. 
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CPTI 57 year period 1950-2006 
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Figure 6: Results of the (a) N-test, (b) conditional L-test, (c) S-test and (d) M-test of the scaled 10-year time- independent 
forecasts using the 57-year target period from 1950 to 2006 of data from the CPTI earthquake catalog. Red and 
green symbols indicate rejected and passed forecasts, respectively. Green symbols with red edges indicate that 
the Poisson forecast was rejected while the NBD forecast was passed. Black crosses; (a) expected number of 
earthquakes, (b) expected conditional log-likelihood score, (c) expected spatial log-likelihood score, (d) expected 
magnitude log-likelihood score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model 
forecast assuming a Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast 
assuming a negative binomial distribution. Vertical lines in S-test figures indicate the likelihood score of a spatially 
uniform model. 
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CPTI 106 year period 1901-2006 
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Figure 7: Results of the (a) N-test, (b) conditional L-test, (c) S-test and (d) M-test of the scaled 10-year time- independent 
forecasts using the 106-year target period from 1901 to 2006 of data from the CPTI earthquake catalog. Red 
and green symbols indicate rejected and passed forecasts, respectively. Green symbols with red edges indicate 
that the Poisson forecast was rejected while the NBD forecast was passed. Black crosses: (a) expected number of 
earthquakes, (b) expected conditional log-likelihood score, (c) expected spatial log-likelihood score, (d) expected 
magnitude log-likelihood score, assuming the forecast is correct. Black bars: 95% confidence bounds of the model 
forecast assuming a Poisson distribution. In (a), grey bars denote 95% confidence bounds of the model forecast 
assuming a negative binomial distribution. Vertical lines in S-test figures indicate the likelihood score of a spatially 
uniform model. 
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Magnitude Bin IVIagnitude Bin 

Figure 8: Magnitude distributions and likelihood scores of the four models that failed the M-test on the 1901-2006 CPTI 
target period, a) observed and predicted histograms, b) bin-wise log-likelihood ratio of the models against a pure 
Gutenberg-Richter(GR) model with b-value equal to one. c) cumulative log-likelihood ratio. 
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Appendix: Negative- Binomial Forecasts 



To create NBD forecasts, we used each forecast's total expected rate as the average of the distribution, 
and we fixed the variance of the forecast to be equal to the observed sample variance from the CPTI 



catalog (estimated in section 4.3 1. Thus, for five-year experiments, we use a-§yj. = 23.73, while for ten-year 



experiments, we use o"ioyr = 64.54. 

For longer time periods (e.g., the durations of the CSI and CPTI catalogs), for which we cannot esti- 
mate directly the sample variance, we used the property that the variance of a finite sum of uncorrelated 
random variables is equal to the sum of their variances. We treated the numbers of observed earthquakes 
as uncorrelated random variables, meaning that we assumed that the numbers of observed earthquakes 
in adjacent time intervals are independent of each other. This is likely to be a better approximation for 
the ten- year intervals. We computed the variance cr^(T) over some finite interval of T years from the 

TlOyr using 

(7) 



Table |4] lists the estimated and calculated variances for the various time intervals we used in this 
study. The NBD parameters, if needed, can be estimated from equations Q and ([5|. Because the direct 
estimate of a^Qy^ is larger than twice a'^yr, it seems that there may be correlations at the five-year time 
scale. Alternatively, the sample size may be too small, because the 95% confidence intervals are large. 

Table 4: Estimated variances of the numbers of observed earthquakes for different time intervals: 

*The variance was estimated directly from the catalog. Others were computed using equation |7|. 



Time Interval T [yrs] 


Estimated (j'^{T) 


5 


23.73* 


10 


64.54* 


18 


116.17 


57 


367.88 


106 


684.12 
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